In this report, I will analyze 3 indicators of the Gapminder Dataset:
Each dataset contains the evolution of each indicator in time from 1800 to today for many countries.
In this report, I will try to answer those questions:
# Import
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
print("CO2 DataFrame\n")
# Load CO2 csv file into a Pandas DataFrame
df_co2 = pd.read_csv('co2_emissions_tonnes_per_person.csv', index_col='country')
# Rename the index and the column
df_co2.columns.name = "Year"
df_co2.index.name = "Country"
# Display info about the dataframe
print(df_co2.info())
df_co2.head()
print("Income DataFrame\n")
# Load Income csv file into a Pandas DataFrame
df_income = pd.read_csv('income_per_person_gdppercapita_ppp_inflation_adjusted.csv', index_col='country')
# Rename the index and the column
df_income.columns.name = "Year"
df_income.index.name = "Country"
# Display info about the dataframe
print(df_income.info())
df_income.head()
print("Life expectency DataFrame\n")
# Load Life Expectency csv file into a Pandas DataFrame
df_life_exp = pd.read_csv('life_expectancy_years.csv', index_col='country')
# Rename the index and the column
df_life_exp.columns.name = "Year"
df_life_exp.index.name = "Country"
# Display info about the dataframe
print(df_life_exp.info())
df_life_exp.head()
From those displayed info, we can notice that:
In this section, I will clean the data. To do so, I will:
I created a python file containing all the cleaning function because they are the same for each indicator DataFrame.
import data_cleaning as clean
clean.columns2Int(df_co2)
clean.removeYearsInFuture(df_co2)
clean.fillNaN(df_co2)
print("CO2 cleaned DataFrame\n")
print(df_co2.info())
print("\nIs there some leftover NaN values: {}".format(df_co2.isna().any().all()))
df_co2.head()
clean.columns2Int(df_income)
clean.removeYearsInFuture(df_income)
clean.fillNaN(df_income)
df_income = clean.values2Float(df_income)
print("Income cleaned DataFrame\n")
print(df_income.info())
print("\nIs there some leftover NaN values: {}".format(df_income.isna().any().all()))
df_income.head()
clean.columns2Int(df_life_exp)
clean.removeYearsInFuture(df_life_exp)
clean.fillNaN(df_life_exp)
print("Life expectency cleaned DataFrame\n")
print(df_life_exp.info())
print("\nIs there some leftover NaN values: {}".format(df_life_exp.isna().any().all()))
df_life_exp.head()
We can now notice that:
Those 3 Dataframes are enough to explore and answer the question 1. However, we still need some cleaning and merging to explore and answer the questin 2. In order to explore some potential corelations between the indicators, we need to concatenate them. To do so, I first transform each DataFrame into a Multi indexing Serie and I concatenate afterwards the 3 Series together. At the end, I have a multi indexing DataFrame with (Country, Year) as index which has 3 columns (CO2, Income and LifeExp).
print("Multi Index DataFrame\n")
df_multi = pd.concat([df_co2.stack(), df_income.stack(), df_life_exp.stack()], axis=1)
df_multi.columns.name = "Indices"
df_multi.columns = ["CO2", "Income", "LifeExp"]
print(df_multi.info())
df_multi
As I mention above, the 3 initial DataFrames didn't have the same list of countries as index. That means that we have some NaN values in our final multi indexing Dataframe. I decide to simply drop the rows that contain NaN.
print("Multi Index cleaned DataFrame\n")
df_multi.dropna(inplace=True)
print("\nIs there some leftover NaN values: {}".format(df_multi.isna().any().all()))
df_multi.info()
Now that I have all the cleaned DataFrame available to answer my 2 question, I can move forward with the Exploratory phase.
In this section I will explore the data to analyze the distribution and evolution in time of the 3 indicators independently from one another. For each indicator I will explore:
# Global variable used for the data analysis for each DataFrame
years_to_plot = [1800, 1850, 1900, 1950, 2000, 2018]
countries_to_plot = ['Australia', 'Brazil', 'Burkina Faso', 'China', 'France', 'Iraq', 'United States']
df_co2[years_to_plot].plot(kind='hist',subplots=True, layout=(3,2), title='Distribution of CO2 consumption per person (in Tonnes) around the world for some given years', figsize=(15,10), grid=True, sharey=True, sharex=False);
df_co2[years_to_plot].describe()
import data_exploration as explore
explore.plotLine(df_co2, countries_to_plot, title="Evolution of CO2 emissions per person", ylabel="CO2 emissions (Tonnes)")
df_co2.loc[countries_to_plot].transpose().describe()
df_income[years_to_plot].hist(figsize=(20,20));
#df_income[years_to_plot].plot(kind='hist',subplots=True, layout=(3,2), title='Distribution of Income per person (GDP per capita Inflation adjusted) around the world for some given years', figsize=(15,10), grid=True, sharey=True, sharex=False);
df_income[years_to_plot].describe()
explore.plotLine(df_income, countries_to_plot, title="Evolution of Income per person", ylabel="Income (GDP per capita Inflation adjusted)")
df_income.loc[countries_to_plot].transpose().describe()
df_life_exp[years_to_plot].plot(kind='hist',subplots=True, layout=(3,2), title='Distribution of the life expectency around the world for some given years', figsize=(15,10), grid=True, sharey=True, sharex=True);
df_life_exp[years_to_plot].describe()
explore.plotLine(df_life_exp, countries_to_plot, title="Evolution of Life expectency over time", ylabel="Age in Years")
df_life_exp.loc[countries_to_plot].transpose().describe()
In this section, I will explore the multi indexing DataFrame containing the 3 indicators to answer the second question of my report. To do so I will:
df_multi.plot(kind='scatter', x='CO2', y='Income', figsize=(8, 8));
df_multi.plot(kind='scatter', x='CO2', y='LifeExp', figsize=(8, 8));
df_multi.plot(kind='scatter', x='LifeExp', y='Income', figsize=(8, 8));
From those 3 plots, we can notice a corelation between the CO2 consumption and the Income, between the Income and the Life Expectency but no corelation between the CO2 consumption and the Life Expectency.
In order to plot the 3 signals on the same plot and be able to see any trend in the evolution of those signal, we first need to compute the ratio of each signal by dividing it by its maximal value (on a country level). This allows us to have the same scale (from 0.0 to 1.0) for the 3 signals instead of different scale (~50 for LifeExp and CO2 and ~100000 for the income).
df_multi['CO2_ratio'] = df_multi['CO2'] / df_multi['CO2'].max(level='Country')
df_multi['Income_ratio'] = df_multi['Income'] / df_multi['Income'].max(level='Country')
df_multi['LifeExp_ratio'] = df_multi['LifeExp'] / df_multi['LifeExp'].max(level='Country')
df_multi
for country in countries_to_plot:
df_multi.loc[country].plot(y=['CO2_ratio', 'Income_ratio', 'LifeExp_ratio'], title="Evolution of the ratio of the indices (% of max) in {}".format(country), figsize=(8,8))
From those plot, we can observe that the 3 indicators tends to start increasing at the same time for each country. But the increase start time is different between the countries.
From this analysis, we can conclude that the CO2 consumption, the life expectency and the income increased globally over time during the last 200 years. We also observed that there always has been some huge disparity of CO2 consumption and Income between the vast majority of the countries in the lower part of those indicators and a few one in the higer values. We observed that all 3 indicators seem to have the same evolution over time. For a given country it seems that they all start increasing around the same time. Finally, we observe a correlation between the CO2 consumption and the Income and between the Income and life expectency.